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Anomaly detection of HTTP requests for outlier detection. Then anomalies in number of HTTP 





Log analysis requests are detected using three techniques namely InterQuartileRange 
Log records method, Moving averages and Median Absolute deviation. Once the outliers 
Neural network are detected, these outliers are removed from the current dataset. This output 
is given as input to the Multilayer Perceptron model to predict the number of 
HTTP requests at the next timestamp. This paper presents a web based model 
to automate the process of anomaly detection in log files. 
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1. INTRODUCTION 

A large number of communication logs are generated by corporate systems. These are used to 
monitor the internet traffic that has been flowing towards a particular source. They produce a large number of 
logs which are then collected and stored. These logs are scraped to extract the particular point of interest and 
the associated timestamp which is converted to timeseries dataset. Such type of data is then used by 
developers and programmers to find any anomalies within the extracted time series dataset. Such anomalies 
cannot be detected manually. There is a requirement to automate the anomaly detection for which large and 
complex systems are built. These systems scan the datasets for any anomalies involving any suspicious or 
interesting part. A large number of traffic anomalies including attacks, large file transfers, flash crowds and 
outages occur fairly frequently. Large enterprise networks have dedicated security operations for 
continuously monitoring the network traffic in order to detect, identify and take action on the anomalies that 
occur in log files. On a smaller scale, these operations are handled by network administrators who can also 
simultaneously work on other day-to-day maintenance operations and planning activities. Although there has 
been a recent growth in network monitoring [1], log analysis [2] and intrusion detection systems [3-4], but it 
is still challenging to correctly detect various types of anomalies. This paper discusses the realization of web 
based framework to detect outliers in terms of number of HTTP requests for the input IP address using three 
algorithms namely Inter-Quartile-Range, Moving Averages, Moving median absolute deviation. Thereafter a 
multilayer perceptron model is built to predict the outliers for a particular time period. This paper focus on 
three main stages in log analysis namely outlier detection, outlier removal and outlier prediction for the given 
log file. 
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2. STATE OF ART DEVELOPMENT 

In the last decade lot of work has been carried out on the outlier detection from log records. 
Recently anomaly detection based systems are also used for cyber-intrusion detection [5]. Anomalies can 
even be generated in payment security domain. All the payment transaction done generate certain type of log 
files through which anomalies can be found such as risk factor, fraud detection etc [6]. 

There are different tools available in market which helps us to detect anomaly. One such tool is 
loggly [7-8]. Loggly’s anomaly detection framework helps in finding fluctuations in event frequency. 
Anomalies may be of various types such as a spike in number of requests etc. The trend chart plotted by 
loggly’s anomaly detection framework allows choosing any field on which the analysis has to be done. It 
shows the fluctuations in the frequencies of the choosen fileds. It also shows changes with reapect to current 
timestamp, the background timestamp and brings the field on top of list with most fluctuations. 

Another existing tool is Nagios [9] which also alerts the development team about the various 
anomalies. Nagios is able to detect different anomalies such as memory usage, disk space usage, port 
connectivity, whether a process is running or not, etc. For all the anomalies detected it sends out the alert via 
notification system. In this type of system generally alerts are generated only when a particular service is 
down. It cannot predict which service may go down in future. 

Every day, large amount of enterprise data is generated in the form of time-stamped logs from 
network devices, security appliances, servers, endpoints, applications, users and so on. The required 
knowledge to efficiently manage and secure IT infrastructures is hidden in this data, but it’s impractical for 
humans to extract this information. Prelert [10] is an anomaly detection engine. It analyzes the data and 
detects anomalies. Then relate them together and provides the information about advanced threat activity and 
any problems related to IT operations. But Prelert is not an open source tool, it is an enterprise application. 


2.1. Contribution towards Framework for Anomaly detectIon in Log Records 
The contributions towards the development of framework for anomaly detection in log files are 

listed below: 

e The proposed tool provides easy to use web interface where users need to upload the log files. 

e For anomaly detection, configurable parameters are provided in the web interface such as window size for 
median, absolute deviation and moving averages. 

e Provision is given to select the IP address for which a user wants to detect and predict anomalies using 
drop-down menu. 

e The prediction models are saved to disk which reduces memory usage. It consists of activation fucntion 
used, input dimension to neural network etc. 

e The proposed tool supports web based interface. 


3. PROPOSED WORK 

The developed framework for log analysis is currently being used by Cisco; it is developed as a web 

based application with the following design goals: 

e Itis light weight framework by supporting minimum dependencies. 

e It makes use of micro framework for server side processing. 

e It also involves the use of Keras [11], a high-level neural networks API, written in Python and capable of 
running on top of either TensorFlow or Theano. 

e The tool also supports the plotting of anomaly detections as well as predictions with the help of google 
charts API. 

The end user must upload the log file, by processing in the background with the help of Regular 
Expressions, all the IP addresses contained in the log file are populated in the drop-down menu in the web 
interface. The user must select the IP for which he wants the outliers to be plotted. Currently the outliers are 
detected in terms of number of HTTP requests for that IP address. Then user must select the algorithms or 
methods for detecting the outliers from Inter-Quartile-Range, Moving Averages, Moving median absolute 
deviation along with their tuning parameters from web interface respectively. After the outlier detection 
phase the user has the option to train the Multilayer perceptron model for epochs provided by user from web- 
interface. Once the model is trained, it is persisted to the disk for future reference. Then the user is provided 
an option to predict the outliers for a particular time period, which is given as input by user through web- 
interface. This paper realizes the log analysis in three stages. First stage focus on outlier detection, second 
stage focus on removal of outliers, third stage deals with outlier prediction. 


Indonesian J Elec Eng & Comp Sci, Vol. 10, No. 1, April 2018 : 343 - 347 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 0 345 


3.1. Outlier Detection 

The main objective of this module is to detect the outliers in terms of number of HTTP requests. 
This module makes use of three algorithms namely Interquartile range, Moving averages and Median 
Absolute deviation. The Interquartile range (IQR) is a measure of variability. It divides the given dataset into 
quartiles. It is a measure of statistical depression. These are used to divide the dataset into 4 equal parts called 
quartiles. The values used to divide each quartile are called the first, second, and third quartiles; and they are 
denoted by Q1, Q2, and Q3, respectively. Q1 is the "middle" value in the first half of the data set. Q2 is the 
median value in the given data set. Q3 is the "middle" value in the second half of the data set. To calculate 
the interquartile one has to compute the difference between Q3 and Q1. For this algorithm the parameter 
named alpha needs to be selected in the range from 0 to 10. Alpha parameter is required for making the upper 
and lower quartile normalized to simplify the calculation. 

In order to detect outliers in the given one dimensional data the data points are marked and are 
calculated based on its standard deviations. But the presence of outliers causes a profound effect on the mean 
and standard deviations and thus the direct use of such a naive technique is not possible. The median absolute 
deviation (MAD) is one of the solutions to handle the variability of a univariate sample of quantitative data. 
For instance, it can also refer to the population parameter. This is estimated by the median absolute deviation 
which is computed from a sample. For this algorithm window size needs to be adjusted between 0 and 1000. 

In the moving average method which is also referred as rolling average or running average, analysis 
of data points is carried out by generating a series of averages on different subsets of the complete data set. It 
is a type of finite impulse response filter. Some variations include: simple, and cumulative, or weighted 
forms (described below).For instance, given a series of numbers and a fixed window size, the first element of 
the moving average is obtained by taking the average of the initial fixed window size of the number series. 
Then in the further steps the new average is calculated by moving the window forward. Thereby, the first 
element is excluded while calculating and this is continued for all such elements. By using one of the 
methods outliers are plotted with help of google charts API. 


3.2. Removal of Outlier 

e For the removal of outliers, first the mean of the given dataset is computed. While computing the mean 
exclude the points marked as anomalous in the previous outlier detection step. 

e The mean is calculated using the data collected either from inter-quartile methodology or moving average 
approach. 

e The selection of the median or average can be given as an option to the system which normalizes all the 
outlier values in correspondence to the given average. 

e The outlier data points are then modified and normalized using these values. After this the mean and 
deviation of the data is recalculated and kept for the analytics purposes. 


3.3 Outlier Prediction 

The main objective of this module is to predict the outliers with respect to number of HTTP requests 
corresponding to a given IP address. In prediction module the Multilayer perceptron model is used with the 
help of keras library. A multilayer perceptron (MLP) is a feedforward artificial neural network model that 
maps sets of input data onto a set of appropriate outputs. In times series data, the sequence of values is 
important. First the dataset is split into training and testing dataset which is in the ratio of 7:3. Now the next 
step is to fit the multilayer perceptron model to the training data. The optimizer being used is RMSPROP 
(Root Mean Square). After the fitting of data, from test data the values are predicted and plotted using google 
charts API. 

The algorithm takes in a look-back value which defines the number of points used in predicting the 
value at any point t. The training is done on the dataset and new values can then be forecasted by the system. 
For training, the user can input the number of epochs (iterations) the model has to run upon to improve 
efficiency. The loss values for each epoch are also calculated as root mean squared error values and gives an 
idea to the user about the accuracy of the model on that dataset. 


4. RESULTS AND ANALYSIS 

The main objective of the discussed tool is to analyze the log files and detect and predict outliers in 
a very simple way. The visualization of outlier detection and prediction are shown using google charts API. 
The tool supports web enabled single page application. The advantage of using the current paradigm is that it 
does not allow the web pages to be rendered from server side. The REST API’s are developed for the 
prototype which sends response as JSON object. The view layer is decoupled from model layer. The view 
layer only consumes the response in JSON from model layer. 
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The model is trained on the extracted time series dataset. The x-axis of the graph shown in Figure | 
indicates the increasing timestamp range and the y-axis indicates the number of request hits at that 
timestamp. The blue region shows the training part of the data and the green part of the graph is used to find 
the accuracy of the given model by treating it as test data. The proposed model uses RMSprop optimizer as it 
provides better results by keeping balance between the speed and accuracy of model training. RMSprop 
divides the learning rate by an exponentially decaying average of squared gradients. 











Figure 1. Sample Graph for Predicted Outliers Plotted in Green Color 


After the model is trained, it is saved and can be further used for prediction. A sample graph for 
predicted values is plotted in red color as shown in Figure 2. The x-axis of the graph indicates the increasing 
timestamp range and the y-axis indicates the number of request hits at that timestamp. The user can input the 
number of HTTP requests to be forecasted and a graph is displayed in which the green dots gives the original 
request hits value and the red dots provides the predicted request hits. By using this model the root mean 
square error was reduced to significant level. 


Fequest counts vs timestamp 
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Figure 2. Sample Graph with Predicted Values Plotted in Red Color 


5. CONCLUSION 

This paper summarizes the design and development of framework for detecting and predicting 
anomalies in log records in apache format. The current tool provides a simple user interface. The tool 
successfully plots the outliers detected using three methods namely interquartile ranges, moving averages or 
median absolute deviation along with the adjustment of the tuning parameters. The tool also gives the user to 
train the model for a particular number of epochs. After training the model is persisted to disk which reduces 
the training of model repeatedly. This significantly reduces the CPU utilization time. The predicted values 
are visualized using google charts API for a given time period provided by user through web interface. The 
major limitation of this framework is that it supports log files in standard apache format. 
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